Case Study - MovieLens

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset.
The dataset contains 100,000 ratings made by 943 users on 1,682 movies.

The MovieLens data is a good example for this because it has a lot of inter-ralationships that will requiere joining/grouping :

the datasets users and ratings are linked together by a key (in this case, the user_id and movie_id).
a rating requires both a user and a movie.
a user can be associated with zero or many ratings and movies.
a movie can be rated zero or many times, by a number of different users.



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)



In [3]:

    
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols)

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_cols, usecols=range(5))

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)

Summary

What are the 10 most rated movies ?
Which movies are most controversial amongst different age ?



In [5]:

    
print movies.head(3), '\n'
print ratings.head(3), '\n'
print users.head(3)









    



   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   

                                            imdb_url  
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...  
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...  
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...   

   user_id  movie_id  rating  unix_timestamp
0      196       242       3       881250949
1      186       302       3       891717742
2       22       377       1       878887116 

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067 

   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         4  Get Shorty (1995)  01-Jan-1995                 NaN   
2         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            imdb_url  user_id  rating  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...      308       4   
1  http://us.imdb.com/M/title-exact?Get%20Shorty%...      308       5   
2  http://us.imdb.com/M/title-exact?Copycat%20(1995)      308       4   

   unix_timestamp  age sex occupation zip_code  
0       887736532   60   M    retired    95076  
1       887737890   60   M    retired    95076  
2       887739608   60   M    retired    95076



In [6]:

    
print lens.head()









    



   movie_id                  title release_date  video_release_date  \
0         1       Toy Story (1995)  01-Jan-1995                 NaN   
1         4      Get Shorty (1995)  01-Jan-1995                 NaN   
2         5         Copycat (1995)  01-Jan-1995                 NaN   
3         7  Twelve Monkeys (1995)  01-Jan-1995                 NaN   
4         8            Babe (1995)  01-Jan-1995                 NaN   

                                            imdb_url  user_id  rating  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...      308       4   
1  http://us.imdb.com/M/title-exact?Get%20Shorty%...      308       5   
2  http://us.imdb.com/M/title-exact?Copycat%20(1995)      308       4   
3  http://us.imdb.com/M/title-exact?Twelve%20Monk...      308       4   
4     http://us.imdb.com/M/title-exact?Babe%20(1995)      308       5   

   unix_timestamp  age sex occupation zip_code  
0       887736532   60   M    retired    95076  
1       887737890   60   M    retired    95076  
2       887739608   60   M    retired    95076  
3       887738847   60   M    retired    95076  
4       887736696   60   M    retired    95076



In [9]:

    
lens.describe()









    Out[9]:






  
    
      
      movie_id
      video_release_date
      user_id
      rating
      unix_timestamp
      age
    
  
  
    
      count
      100000.000000
      0
      100000.00000
      100000.000000
      1.000000e+05
      100000.000000
    
    
      mean
      425.530130
      NaN
      462.48475
      3.529860
      8.835289e+08
      32.969850
    
    
      std
      330.798356
      NaN
      266.61442
      1.125674
      5.343856e+06
      11.562623
    
    
      min
      1.000000
      NaN
      1.00000
      1.000000
      8.747247e+08
      7.000000
    
    
      25%
      175.000000
      NaN
      254.00000
      3.000000
      8.794487e+08
      24.000000
    
    
      50%
      322.000000
      NaN
      447.00000
      4.000000
      8.828269e+08
      30.000000
    
    
      75%
      631.000000
      NaN
      682.00000
      4.000000
      8.882600e+08
      40.000000
    
    
      max
      1682.000000
      NaN
      943.00000
      5.000000
      8.932866e+08
      73.000000

What are the 10 most rated movies ?

Split the DataFrame into groups by movie title
Apply the size method to get the count of records in each group
Order our results in descending order
Limit the output to the top 10 using Python's slicing syntax.

using .groupby()



In [5]:

    
most_rated = lens.groupby('title').size().order(ascending=False)[:10]
print most_rated









    



title
Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64

using .value_count()



In [7]:

    
# using .value_count()
lens.title.value_counts()[:10]









    Out[7]:





Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64

Which movies are most highly rated ?

Split the DataFrame into groups by movie title
Apply the size and mean to each group using .agg()method
Order the results by average rating in descending order
Keep only movies with at least 100 ratings

We can use the .agg() method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.



In [25]:

    
movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
print movie_stats.head()









    



                          rating          
                            size      mean
title                                     
'Til There Was You (1997)      9  2.333333
1-900 (1994)                   5  2.600000
101 Dalmatians (1996)        109  2.908257
12 Angry Men (1957)          125  4.344000
187 (1997)                    41  3.024390

Let's sort the resulting DataFrame so that we can see which movies have the highest average score.
Because movie_stats is a DataFrame, we use the sort method (only Series objects use order).
Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.



In [26]:

    
print movie_stats.sort([('rating', 'mean')], ascending=False).head()









    



                                           rating     
                                             size mean
title                                                 
They Made Me a Criminal (1939)                  1    5
Marlene Dietrich: Shadow and Light (1996)       1    5
Saint of Fort Washington, The (1993)            2    5
Someone Else's America (1995)                   1    5
Star Kid (1997)                                 3    5

The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.



In [55]:

    
atleast_100 = movie_stats['rating']['size'] > 100
print movie_stats[atleast_100].sort([('rating', 'mean')], ascending=False).head()









    



                                 rating          
                                   size      mean
title                                            
Close Shave, A (1995)               112  4.491071
Schindler's List (1993)             298  4.466443
Wrong Trousers, The (1993)          118  4.466102
Casablanca (1942)                   243  4.456790
Shawshank Redemption, The (1994)    283  4.445230

Which movies are most controversial amongst different age ?

Bin users into age groups using pandas.cut()
Split the DataFrame into groups by age groups
Apply the size and mean to each group using .agg()method
Order the results by average rating in descending order

Limiting our population going forward
Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.



In [8]:

    
most_50 = lens.groupby('movie_id').size().order(ascending=False)[:50]

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.



In [9]:

    
users.age.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');



In [10]:

    
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
bins = range(0, 81, 10) # [0, 10, 20, 30, 40, 50, 60, 70, 80]
lens['age_group'] = pd.cut(lens.age, bins, right=False, labels=labels)
print lens[['age', 'age_group']].drop_duplicates()[:5] # preview of age bin









    



     age age_group
0     60     60-69
397   21     20-29
459   33     30-39
524   30     30-39
782   23     20-29

Now we can now compare ratings across age groups.



In [62]:

    
print lens.groupby('age_group').agg({'rating': [np.size, np.mean]})









    



          rating          
            size      mean
age_group                 
0-9           43  3.767442
10-19       8181  3.486126
20-29      39535  3.467333
30-39      25696  3.554444
40-49      15021  3.591772
50-59       8704  3.635800
60-69       2623  3.648875
70-79        197  3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.



In [63]:

    
lens.set_index('movie_id', inplace=True)



In [64]:

    
by_age = lens.ix[most_50.index].groupby(['title', 'age_group'])
by_age.rating.mean().head(15)









    Out[64]:





title                 age_group
Air Force One (1997)  10-19        3.647059
                      20-29        3.666667
                      30-39        3.570000
                      40-49        3.555556
                      50-59        3.750000
                      60-69        3.666667
                      70-79        3.666667
Alien (1979)          10-19        4.111111
                      20-29        4.026087
                      30-39        4.103448
                      40-49        3.833333
                      50-59        4.272727
                      60-69        3.500000
                      70-79        4.000000
Aliens (1986)         10-19        4.050000
Name: rating, dtype: float64

Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.

Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.

Behold! The magic of unstack!



In [66]:

    
by_age.rating.mean().unstack(1)[10:20]









    Out[66]:






  
    
      age_group
      0-9
      10-19
      20-29
      30-39
      40-49
      50-59
      60-69
      70-79
    
    
      title
      
      
      
      
      
      
      
      
    
  
  
    
      E.T. the Extra-Terrestrial (1982)
      NaN
      3.680000
      3.609091
      3.806818
      4.160000
      4.368421
      4.375000
      NaN
    
    
      Empire Strikes Back, The (1980)
      4
      4.642857
      4.311688
      4.052083
      4.100000
      3.909091
      4.250000
      5.000000
    
    
      English Patient, The (1996)
      5
      3.739130
      3.571429
      3.621849
      3.634615
      3.774648
      3.904762
      4.500000
    
    
      Fargo (1996)
      NaN
      3.937500
      4.010471
      4.230769
      4.294118
      4.442308
      4.000000
      4.333333
    
    
      Forrest Gump (1994)
      5
      4.047619
      3.785714
      3.861702
      3.847826
      4.000000
      3.800000
      NaN
    
    
      Fugitive, The (1993)
      NaN
      4.320000
      3.969925
      3.981481
      4.190476
      4.240000
      3.666667
      NaN
    
    
      Full Monty, The (1997)
      NaN
      3.421053
      4.056818
      3.933333
      3.714286
      4.146341
      4.166667
      3.500000
    
    
      Godfather, The (1972)
      NaN
      4.400000
      4.345070
      4.412844
      3.929412
      4.463415
      4.125000
      NaN
    
    
      Groundhog Day (1993)
      NaN
      3.476190
      3.798246
      3.786667
      3.851064
      3.571429
      3.571429
      4.000000
    
    
      Independence Day (ID4) (1996)
      NaN
      3.595238
      3.291429
      3.389381
      3.718750
      3.888889
      2.750000
      NaN

unstack, well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.

If we would have used:

by_age.rating.mean().unstack(0).fillna(0)

We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on?

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).



In [67]:

    
lens.reset_index('movie_id', inplace=True)



In [78]:

    
pivoted = pd.pivot_table(lens, values='rating', index=['movie_id', 'title'],
...                     columns=['sex'], fill_value=0)
print pivoted.head()









    



sex                                F         M
movie_id title                                
1        Toy Story (1995)   3.789916  3.909910
2        GoldenEye (1995)   3.368421  3.178571
3        Four Rooms (1995)  2.687500  3.108108
4        Get Shorty (1995)  3.400000  3.591463
5        Copycat (1995)     3.772727  3.140625



In [79]:

    
pivoted['diff'] = pivoted.M - pivoted.F
print pivoted.head()









    



sex                                F         M      diff
movie_id title                                          
1        Toy Story (1995)   3.789916  3.909910  0.119994
2        GoldenEye (1995)   3.368421  3.178571 -0.189850
3        Four Rooms (1995)  2.687500  3.108108  0.420608
4        Get Shorty (1995)  3.400000  3.591463  0.191463
5        Copycat (1995)     3.772727  3.140625 -0.632102



In [80]:

    
pivoted.reset_index('movie_id', inplace=True)



In [81]:

    
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.order().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');

	movie_id	video_release_date	user_id	rating	unix_timestamp	age
count	100000.000000	0	100000.00000	100000.000000	1.000000e+05	100000.000000
mean	425.530130	NaN	462.48475	3.529860	8.835289e+08	32.969850
std	330.798356	NaN	266.61442	1.125674	5.343856e+06	11.562623
min	1.000000	NaN	1.00000	1.000000	8.747247e+08	7.000000
25%	175.000000	NaN	254.00000	3.000000	8.794487e+08	24.000000
50%	322.000000	NaN	447.00000	4.000000	8.828269e+08	30.000000
75%	631.000000	NaN	682.00000	4.000000	8.882600e+08	40.000000
max	1682.000000	NaN	943.00000	5.000000	8.932866e+08	73.000000

age_group	0-9	10-19	20-29	30-39	40-49	50-59	60-69	70-79
title
E.T. the Extra-Terrestrial (1982)	NaN	3.680000	3.609091	3.806818	4.160000	4.368421	4.375000	NaN
Empire Strikes Back, The (1980)	4	4.642857	4.311688	4.052083	4.100000	3.909091	4.250000	5.000000
English Patient, The (1996)	5	3.739130	3.571429	3.621849	3.634615	3.774648	3.904762	4.500000
Fargo (1996)	NaN	3.937500	4.010471	4.230769	4.294118	4.442308	4.000000	4.333333
Forrest Gump (1994)	5	4.047619	3.785714	3.861702	3.847826	4.000000	3.800000	NaN
Fugitive, The (1993)	NaN	4.320000	3.969925	3.981481	4.190476	4.240000	3.666667	NaN
Full Monty, The (1997)	NaN	3.421053	4.056818	3.933333	3.714286	4.146341	4.166667	3.500000
Godfather, The (1972)	NaN	4.400000	4.345070	4.412844	3.929412	4.463415	4.125000	NaN
Groundhog Day (1993)	NaN	3.476190	3.798246	3.786667	3.851064	3.571429	3.571429	4.000000
Independence Day (ID4) (1996)	NaN	3.595238	3.291429	3.389381	3.718750	3.888889	2.750000	NaN